import numpy as np
import scipy as sp
import scipy.stats
import matplotlib.pyplot as plt
import math as mt
import scipy.special
import seaborn as sns

plt.style.use("fivethirtyeight")
from statsmodels.graphics.tsaplots import plot_acf
import pandas as pd

Define some color in this cell. \[ \require{color} \definecolor{red}{RGB}{240,5,5} \definecolor{blue}{RGB}{5,5,240} \definecolor{green}{RGB}{4,240,5} \definecolor{black}{RGB}{0,0,0} \definecolor{dsb}{RGB}{72, 61, 139} \definecolor{Maroon}{RGB}{128,0,0} \]

Linear Regression Model Setting

The common matrix form of linear regression is \[ y = X \beta+u \] where \(u \sim N( {0}, \sigma^2 I)\).

The covariance matrix of disturbance term \(u\) is \[ \begin{aligned} &\operatorname{var}(u) \equiv\left[\begin{array}{cccc} \operatorname{var}\left(u_{1}\right) & \operatorname{cov}\left(u_{1}, u_{2}\right) & \ldots & \operatorname{cov}\left(u_{1}, u_{N}\right) \\ \operatorname{cov}\left(u_{1}, u_{2}\right) & \operatorname{var}\left(u_{2}\right) & \ldots & . \\ \cdot & \operatorname{cov}\left(u_{2}, u_{3}\right) & \ldots & . \\ \cdot & \cdot & \ldots & \operatorname{cov}\left(u_{N-1}, u_{N}\right) \\ \operatorname{cov}\left(u_{1}, u_{N}\right) & . & \ldots & \operatorname{var}\left(u_{N}\right) \end{array}\right]=\left[\begin{array}{cccc} h^{-1} & 0 & \ldots & 0 \\ 0 & h^{-1} & \ldots & . \\ . & . & \ldots & . \\ . & . & \cdots & 0 \\ 0 & . & . & h^{-1} \end{array}\right] \end{aligned} \]

Some derivation usually make use of \(h = 1/\sigma^2\), which is the precision. The diagonal form of covariance matrix represents two assumptions: no serial correlation and homoscedasticity.

Also with the assumption that \(X\) is exogenous, we can construct the joint probability density \[ P(y, X \mid \beta, h)=P(y \mid X, \beta, h) P(X) \]

However because \(X\) does not depend \(\beta\) and \(h\), we narrow our interest onto \[ P(y \mid X, \beta, h) \]

Recall that multivariate normal distribution takes the form

\[ f(X)=(2 \pi)^{-N / 2}|\Sigma|^{-1 / 2} \exp \left(-\frac{1}{2}(X-\mu)^T \Sigma^{-1}(X-\mu)\right) \text { for } \sigma>0 \]

where \(|\Sigma|\) is the determinant of covariance matrix, \(\Sigma^{-1}\) is the inverse of covariance matrix.

Normal-Gamma Conjugacy

In the linear regression context, the determinant of covariance matrix is \[ |\text{var}(u)|^{-1/2}=\left(\prod_{i=1}^Nh^{-1}\right)^{-1/2} = (h^{-N})^{-1/2}=h^{N/2}=|\Sigma|^{-1/2} \]

The inverse matrix of covariance matrix \[ \text{var}(u)^{-1} = (h^{-1}I)^{-1}=hI=\Sigma^{-1} \]

Likelihood Function

With all previous setting, the the likelihood function \(P(y \mid X, \beta, h)\) is simplified to \[ P(y \mid X, \beta, h)=(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp \left(-\frac{h}{2}(y-\mathrm{X} \beta)^T(y-\mathrm{X} \beta)\right) \]

Kernel Decomposition

\[ \begin{aligned} ( {y}- {X} {\beta})^{T}( {y}- {X} {\beta})=&(\overbrace{ {y}- {X} \hat{ {\beta}}}^{A}+\overbrace{{X} \hat{{\beta}}- {X} {\beta}}^{B})^{T} (\overbrace{{y}- {X} \hat{ {\beta}}}^{A}+\overbrace{ {X} \hat{ {\beta}}- {X} {\beta}}^{B}) \\ =& A^{T} A+B^{T} B+ A^{T} B+B^TA\\ =& A^{T} A+B^{T} B+2 A^{T} B \\ &\text{(We use the fact }A^TB=B^TA \text{, because both terms are scalar, transposition is itself)}\\ =&A^{T} A+B^{T} B-2\overbrace{( {y}- {X} \hat{ {\beta}})^{T}( {X} \hat{ {\beta}}- {X} {\beta})}^{0} \\ =&({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{\beta}^TX^T-\beta^TX^T)(X\hat{\beta}-X\beta) \\ =&({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) \end{aligned} \]

The kernel decomposition of likelihood function is \[ ( {y}- {X} {\beta})^{T}( {y}- {X} {\beta})=({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) \]

Plug in back to likelihood function \[ P(y \mid X, \beta, h)=(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp \left[-\frac{h}{2}\left(({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\beta-\hat{\beta})^T {X}^T {X}(\beta-\hat{\beta})\right)\right] \]

Separate the exponential part \[ p(y \mid \mathrm{X}, \beta, h)=(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp \left[-\frac{h}{2}\left(\beta-\widehat{\beta}\right)^T\mathrm{X}^T \mathrm{X}\left(\beta-\widehat{\beta}\right)\right] \exp \left[-\frac{h}{2}({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})\right] \]

This form of likelihood function suggest a natural conjugate prior which has the same function form as likelihood and also yields a posterior within the same class of distribution.

Prior

Our prior should be a joint distribution \(P(\beta, h)\), however in order to transform into Normal-Gamma distribution, it proves convenient to write \[ P(\beta, h)=P(\beta \mid h) P(h) \]

And it is the Normal-Gamma distribution we are talking about, the right-hand side has the follow distribution \[ \begin{gathered} \beta \mid h \sim N\left(\mu, (h V)^{-1}\right) \\ h \sim \operatorname{Gamma}(m, v) \end{gathered} \] The advantage of this distribution is that \(\beta\) is on real number’s domain and \(h\) is on positive number’s domain.

Recall the function form of NG distribution. \[ f(X, h)=(2 \pi)^{-\frac{N}{2}}(h)^{-\frac{N}{2}}|\Sigma|^{-1 / 2} \exp \left(-\frac{h}{2}(X-\mu)^T \Sigma^{-1}(X-\mu)\right) \frac{1}{\left(\frac{2 m}{v}\right)^{v / 2}} \frac{1}{\Gamma\left(\frac{v}{2}\right)} h^{\frac{v-2}{2}} \exp \left[-\frac{h v}{2 m}\right] \]

Prior for normal part \(P(\beta\mid h)\) \[ \begin{gathered} P(\beta \mid h)=(2 \pi)^{-\frac{k}{2}} h^{\frac{k}{2}}|V|^{\frac{1}{2}} \exp \left(-\frac{h}{2}(\beta-\mu)^T V(\beta-\mu)\right) \end{gathered} \] where \(k\) is the number of parameters in \(\beta\). We use the rule of determinant \(|V^{-1}|=|V|^{-1}\).

Prior for gamma part \(P(h)\) \[ P(h)=\frac{1}{\left(\frac{2 m}{v}\right)^{v / 2}} \frac{1}{\Gamma\left(\frac{v}{2}\right)} h^{\frac{v-2}{2}} \exp \left[-\frac{h v}{2 m}\right] \]

Posterior

The posterior is formulated by Bayes’ Theorem

\[ P(\beta, h \mid y, X)\propto P(y \mid X, \beta, h)P(\beta\mid h) P(h) \]

\[ P(\beta, h \mid Y, {X}) \propto \ h^{\frac{k}{2}} \exp \left[-\frac{h}{2}\left[\left(\beta-\hat{\beta}\right)^T{X}^{T} {X}\left(\beta-\hat{\beta}\right)+(\beta-\mu)^T V(\beta-\mu)\right]\right] h^{\frac{N+v-2}{2}} \exp \left[-\frac{h}{2}\left(({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+\frac{v}{m}\right)\right] \]

Expand both terms, again the \(2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}\) and \(2 {\mu}^{T} {\Lambda} {\beta}\) exist because cross terms are scalars. \[\begin{align} ({\beta}-\hat{\beta})^{T} {X}^{T} {X}({\beta}-\hat{\beta})+\left({\beta}-{\mu}\right)^{T} {V}\left({\beta}-{\mu}\right)&= {\beta}^{T} {X}^{T} {X} {\beta}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}-2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}+ {\beta}^{T} V {\beta}+ {\mu}^{T} V {\mu}-2 {\mu}^{T} V {\beta}\\ &=\color{red}{\beta}^{T} {X}^{T} {X} {\beta}\color{black} + \color{red}{\beta}^{T} {V} {\beta} \color{black}-\color{blue}2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}-\color{blue}2 {\mu}^{T} V {\beta}+\color{black} {\mu}^{T} V {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\color{red}\beta^T\underbrace{(X^TX+V)}_{W}\beta\color{black}-\color{blue}2\underbrace{(\hat{\beta}^TX^TX+\mu^TV)}_{w^T}\beta+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\underbrace{\color{red}\beta^TW\beta\color{black}-\color{blue}2w^T\beta}_{\text{needs completing the square}}+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ \end{align}\]

where the variable are replace for completing the square \[ W = (X^TX+V)\\ w^T=\hat{\beta}^TX^TX+\mu^TV \]

Details of completing the square \[ \begin{align} \color{red}\beta^TW\beta\color{black}-\color{blue}2w^T\beta\color{black}+\overbrace{\color{green}w^TW^{-1}w-w^TW^{-1}w}^{0} &=\overbrace{\color{red}\beta^TW\beta\color{black}-\color{blue}w^T\beta\color{black}-\color{blue}\beta^Tw\color{black}+\color{green}w^TW^{-1}w}^{\text{complete the square}}-\color{green}w^TW^{-1}w\\ &=\color{Maroon}(\beta^TW-w^T)(\beta-W^{-1}w)\color{black}-\color{green}w^TW^{-1}w\\ &=\color{Maroon}(\beta^T-w^TW^{-1})W(\beta-W^{-1}w)\color{black}-\color{green}w^TW^{-1}w\\ &=\color{Maroon}(\beta-W^{-1}w)^TW(\beta-W^{-1}w)\color{black}-\color{green}w^TW^{-1}w \end{align} \] We used the fact that \((M^{-1})^T=M^{-1}\).

So this is the new expression we are working on \[ (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {V}\left({\beta}-{\mu}\right)=\color{Maroon}\underbrace{\color{Maroon}(\beta-W^{-1}w)^TW(\beta-W^{-1}w)}_{\text{Posterior normal kernal}}\color{black}-\color{green}w^TW^{-1}w+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}} \]

Define the mean in the posterior normal kernel, \(N\) subscript in \(\mu_N^{\ }\) means it is from the normal part \[ \begin{align} \mu_{N}^{\ } = W^{-1}w &= W^{-1}(X^TX\hat{\beta}+V^T\mu)\\ &=W^{-1}(X^TX(X^TX)^{-1}X^Xy+V^T\mu)\\ &=W^{-1}\underbrace{(X^Ty+V^T\mu)}_{w} \end{align} \]

So this is the new expression we are working on \[ (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {V}\left({\beta}-{\mu}\right)=\color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernel}}\color{black}-\color{green}w^TW^{-1}w+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}} \]

So the undecomposed the likelihood kernel with \(P(\beta\mid h)\) kernel is \[ \underbrace{({y}-{X} {\beta})^{T}({y}-{X} {\beta})}_{({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) }+\left({\beta}-{\mu}\right)^{T} {V}\left({\beta}-{\mu}\right)=({y}-{X} \hat{\beta})^{T}({y}-{X} \hat{\beta})+\color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernel}}\color{black}-\color{green}w^TW^{-1}w+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}} \]

Simplify each on the right-hand side

\[ \begin{aligned} ( {y}- {X} \hat{ {\beta}})^{T}( {y}- {X} \hat{ {\beta}})&= {y}^{T} {y}+\hat{ {\beta}}^{T} {X}^{T} {X} \hat{ {\beta}} -2 {y}^{T} {X} \hat{ {\beta}} \\ &= {y}^{T} {y}+\hat{ {\beta}}^{T} {X}^{T} {X} (\overbrace{X^TX)^{-1}X^Ty}^{\hat{\beta}} -2 {y}^{T} {X} \hat{ {\beta}} \\ &= {y}^{T} {y}+\hat{ {\beta}}^{T} X^Ty -2 {y}^{T} {X} \hat{ {\beta}} \\ &= {y}^{T} {y}+(\overbrace{(X^TX)^{-1}X^Ty}^{\hat{\beta}})^{T} X^Ty -2 {y}^{T} {X} \hat{ {\beta}} \\ &= y^Ty+y^TX(X^TX)^{-1}X^Ty - 2y^TX\hat{\beta}\\ &= {y}^{T} {y}+{y}^{T} {X} \hat{ {\beta}}-2{y}^{T} {X} \hat{ {\beta}}\\ &={y}^{T} {y}-{y}^{T} {X} \hat{ {\beta}}\\ \end{aligned} \]

\[ \begin{align} w^TW^{-1}w &= (X^Ty+V^T\mu)^T\underbrace{W^{-1}(X^Ty+V^T\mu)}_{\mu_{N}^{\ }}\\ &= (X^Ty+\Lambda^T\mu)^T{\mu_{N}^{\ }}\\ &= \mu_N^TW \mu_{N}^{\ } \end{align} \]

\[ \begin{align} \hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}&=\hat{ {\beta}}^T {X}^{T} {X}(X^TX)^{-1}X^Ty\\ &=\hat{ {\beta}}^T X^Ty\\ \end{align} \]

Join all the terms of undecomposed likelihood kernel plus \(P(\beta\mid h)\) \[ \begin{align} \underbrace{({y}-{X} {\beta})^{T}({y}-{X} {\beta})}_{({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) }+\left({\beta}-{\mu}\right)^{T} {V}\left({\beta}-{\mu}\right)&=y^T{y}-{y}^{T} {X} \hat{ {\beta}}+\color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernal} } \color{black}-\mu_N^T W \mu_{N}^{\ }+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &= \color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernal}}\color{black}+{y}^{T} {y} -\mu_N^T W \mu_{N}^{\ }+\color{black} {\mu}^{T} {V} {\mu} \end{align} \] where \[ W=X^TX+V\\ \mu_{N}^{\ }=W^{-1}\underbrace{(X^Ty+V^T\mu)}_{w} \]

\[ \begin{align} P(\beta, h \mid {X}, y) \propto&(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp{\left(-\frac{h}{2}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })\right)}\\ &\times(2 \pi)^{-\frac{k}{2}} h^{\frac{k}{2}}|V|^{\frac{1}{2}}\exp \left(-\frac{h}{2}({y}^{T} {y} -\mu_N^T W \mu_{N}^{\ }+\color{black} {\mu}^{T} {V} {\mu})\right)\\ &\times\frac{1}{\beta^{\alpha}}\frac{1}{\Gamma(\alpha)} h^{\alpha-1}\exp{\left(-\frac{h}{\beta}\right)} \end{align} \]

Omit the terms that do not depend on \(\beta\) or \(h\) \[ \begin{align} P(\beta, h \mid {X}, y) \propto&(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp{\left(-\frac{h}{2}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })\right)}\\ &\times h^{\frac{k}{2}}|V|^{\frac{1}{2}}\exp \left({y}^{T} {y} -\mu_N^T W \mu_{N}^{\ }+\color{black} {\mu}^{T} {V} {\mu}\right)\\ &\times\frac{1}{\beta^{\alpha}}\frac{1}{\Gamma(\alpha)} h^{\alpha-1}\exp{\left(-\frac{h}{\beta}\right)} \end{align} \]

Omit the terms that do not depend on \(\beta\) or \(h\) \[ \begin{align} P(\beta, h \mid {X}, y) \propto&(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp{\left(-\frac{h}{2}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })\right)}\\ &\times h^{\alpha+\frac{k}{2}-1}|V|^{\frac{1}{2}}\exp \left(-\frac{h}{\beta}+{y}^{T} {y} -\mu_N^T W \mu_{N}^{\ }+\color{black} {\mu}^{T} {V} {\mu}\right) \end{align} \]

First ignore the constant parts such as \(2\pi\) and gamma function, then combine \(h\)

\[ \begin{align} P(\beta, h \mid Y, \mathrm{X}) &\propto h^{\frac{k}{2}} \exp \left[-\frac{h}{2}\left(\beta-\widehat{\beta}\right)^T\mathrm{X}^T \mathrm{X}\left(\beta-\widehat{\beta}\right)\right] \exp \left[-\frac{h}{2}\hat{u}^T \hat{u}\right]\\ &\times \exp \left[-\frac{h}{2}(\beta-\mu)^{\prime} V^{-1}(\beta-\mu)\right] h^{\frac{N+v-2}{2}} \exp \left[-\frac{h v}{2 m}\right] \end{align} \]

Join exponential terms \[ \begin{align} P(\beta, h \mid Y, \mathrm{X}) &\propto h^{\frac{k}{2}} \exp \left[-\frac{h}{2}\left(\beta-\widehat{\beta}\right)^T\mathrm{X}^T \mathrm{X}\left(\beta-\widehat{\beta}\right)-\frac{h}{2}(\beta-\mu)^{\prime} V^{-1}(\beta-\mu)\right] h^{\frac{N+v-2}{2}} \exp \left[-\frac{h}{2}\hat{u}^T \hat{u}-\frac{h v}{2 m}\right]\\ &\propto\ h^{\frac{k}{2}} \exp \left[-\frac{h}{2}\left[\left(\beta-\widehat{\beta}\right)^T\mathrm{X}^{\prime} \mathrm{X}\left(\beta-\widehat{\beta}\right)+(\beta-\mu)^T V^{-1}(\beta-\mu)\right]\right] h^{\frac{N+v-2}{2}} \exp \left[-\frac{h}{2}\left(\hat{u}^T \hat{u}+\frac{v}{m}\right)\right] \end{align} \]

Normal-Inverse Gamma Conjugacy

Recall multivariate normal distribution density function

\[ f(X)=(2 \pi)^{-\frac{N}{2}}|\Sigma|^{-\frac{1}{2}} \exp \left(-\frac{1}{2}(X-\mu)^T \Sigma^{-1}(X-\mu)\right) \text { for } \sigma>0 \]

We assume likelihood \({y} \sim {\mathcal{N}}\left({ {X}\beta}, \sigma^{2} {I}\right)\) follows exactly the same form as density function above

\[ p\left({y} \mid {X}, {\beta}, \sigma^{2}\right)=(2 \pi)^{-\frac{N}{2}}\left|\sigma^{2} {I}\right|^{-\frac{1}{2}} \exp \left(-\frac{1}{2}\left({y}-{ {X}\beta}\right)^{T}\left(\sigma^{2} {I}\right)^{-1}\left({y}-{ {X}\beta}\right)\right) \]

where \(N\) is the number of observation in the data.

Use determinants rules \[ \left|\sigma^{2} {I}\right|=\sigma^{2N}\\ \left(\sigma^{2} {I}\right)^{-1}=\sigma^{-2}I \]

Likelihood simplifed to

\[ p\left({y} \mid {X}, {\beta}, \sigma^{2}\right)=\left(2 \pi \sigma^{2}\right)^{-N / 2} \exp \left(-\frac{1}{2 \sigma^{2}}\left({y}- {X}{\beta}\right)^{T}\left( {y}- {X}{\beta}\right)\right) \]

Prior is usually decomposed \(p\left({\beta}, \sigma^{2}\right)=p\left({\beta} \mid \sigma^{2}\right) p\left(\sigma^{2}\right)\) where \({\beta} \mid \sigma^{2} \sim {N}\left({0}, \sigma^{2} {\Lambda}^{-1}\right)\) and \(\sigma^{2} \sim \operatorname{InvGamma}\left(\alpha, \beta\right)\)

\[ p\left( {\beta} \mid \sigma^{2}\right)=(2 \pi)^{-\frac{k}{2}}\left|\sigma^{2} {\Lambda}^{-1}\right|^{-\frac{1}{2}} \exp \left(-\frac{1}{2}\left({\beta}-{\mu}\right)^T\left(\sigma^2 {\Lambda}^{-1}\right)^{-1}\left({\beta}-{\mu}\right)\right) \]

where \(k\) is the number of \(\beta\) parameters in the model.

Use determinants rules \[ |\sigma^2\Lambda^{-1}|^{-\frac{1}{2}}=(\sigma^{2k}|\Lambda^{-1}|)^{-\frac{1}{2}}=(\sigma^{2k}|\Lambda|^{-1})^{-\frac{1}{2}}=\sigma^{-k}|\Lambda|^{\frac{1}{2}}\\ \] then \[ (2 \pi)^{-\frac{k}{2}}\left|\sigma^{2} {\Lambda}^{-1}\right|^{-\frac{1}{2}}=(2\pi\sigma^2)^{-\frac{k}{2}}|\Lambda|^\frac{1}{2} \]

First part of prior \(P\left( {\beta} \mid \sigma^{2}\right)\) simplified to \[ p\left( {\beta} \mid \sigma^{2}\right)=\left(2 \pi \sigma^{2}\right)^{-\frac{k}{2}}\left|{\Lambda}\right|^{\frac{1}{2}} \exp \left(-\frac{1}{2 \sigma^{2}}\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)\right) \]

Recall the inverse gamma distribution is

\[ f(x ; \alpha, \beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}(1 / x)^{\alpha+1} \exp (-\beta / x) \]

following the function form, the second part of prior \(P(\sigma^2)\) is \[ P\left(\sigma^{2}\right)=\frac{\beta^{\alpha}}{\Gamma\left(\alpha\right)}\left(\sigma^{2}\right)^{-\left(\alpha+1\right)} \exp \left(-\beta / \sigma^{2}\right) \]

Kernel Decomposition

To move forward, there will be horrible amount of linear algebraic manipulation, the first is kernel decomposition of likelihood function

\[ \begin{aligned} ( {y}- {X} {\beta})^{T}( {y}- {X} {\beta})=&(\overbrace{ {y}- {X} \hat{ {\beta}}}^{A}+\overbrace{{X} \hat{{\beta}}- {X} {\beta}}^{B})^{T} (\overbrace{{y}- {X} \hat{ {\beta}}}^{A}+\overbrace{ {X} \hat{ {\beta}}- {X} {\beta}}^{B}) \\ =& A^{T} A+B^{T} B+ A^{T} B+B^TA\\ =& A^{T} A+B^{T} B+2 A^{T} B \\ &\text{(We use the fact }A^TB=B^TA \text{, because both terms are scalar, transposition is itself)}\\ =&A^{T} A+B^{T} B-2\overbrace{( {y}- {X} \hat{ {\beta}})^{T}( {X} \hat{ {\beta}}- {X} {\beta})}^{0} \\ =&({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{\beta}^TX^T-\beta^TX^T)(X\hat{\beta}-X\beta) \\ =&({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) \end{aligned} \]

The kernel decomposition of likelihood function is \[ ( {y}- {X} {\beta})^{T}( {y}- {X} {\beta})=({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) \]

The quadratic form \((\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})\) is created be combined with the kernal of prior \(P\left( {\beta} \mid \sigma^{2}\right)\) that is \(\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)\)

Joining second decomposed terms in likelihood and prior \(P(\beta\mid\sigma^2)\) will end up as an addition of them due the feature of exponential term \[ (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right) \]

Expand both terms, again the \(2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}\) and \(2 {\mu}^{T} {\Lambda} {\beta}\) exist because cross terms are scalars. \[\begin{align} (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)&= {\beta}^{T} {X}^{T} {X} {\beta}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}-2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}+ {\beta}^{T} {\Lambda} {\beta}+ {\mu}^{T} {\Lambda} {\mu}-2 {\mu}^{T} {\Lambda} {\beta}\\ &=\color{red}{\beta}^{T} {X}^{T} {X} {\beta}\color{black} + \color{red}{\beta}^{T} {\Lambda} {\beta} \color{black}-\color{blue}2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}-\color{blue}2 {\mu}^{T} {\Lambda} {\beta}+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\color{red}\beta^T\underbrace{(X^TX+\Lambda)}_{M}\beta\color{black}-\color{blue}2\underbrace{(\hat{\beta}^TX^TX+\mu^T\Lambda)}_{m^T}\beta+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\underbrace{\color{red}\beta^TM\beta\color{black}-\color{blue}2m^T\beta}_{\text{needs completing the square}}+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ \end{align}\]

We have defined \[ M=X^TX+\Lambda\\ m^T = \hat{\beta}^TX^TX+\mu^T\Lambda \]

Continue to complete the square of colored term \[ \begin{align} \color{red}\beta^TM\beta\color{black}-\color{blue}2m^T\beta\color{black}+\overbrace{\color{green}m^TM^{-1}m-m^TM^{-1}m}^{0} &=\overbrace{\color{red}\beta^TM\beta\color{black}-\color{blue}m^T\beta\color{black}-\color{blue}\beta^Tm\color{black}+\color{green}m^TM^{-1}m}^{\text{complete the square}}-\color{green}m^TM^{-1}m\\ &=\color{Maroon}(\beta^TM-m^T)(\beta-M^{-1}m)\color{black}-\color{green}m^TM^{-1}m\\ &=\color{Maroon}(\beta^T-m^TM^{-1})M(\beta-M^{-1}m)\color{black}-\color{green}m^TM^{-1}m\\ &=\color{Maroon}(\beta-M^{-1}m)^TM(\beta-M^{-1}m)\color{black}-\color{green}m^TM^{-1}m \end{align} \]

Last step above made use of the fact \((M^{-1})^T=M^{-1}\), invoking the inverse matrix rule \((M^{-1})^T = (M^{T})^{-1}\) .

So this is the new expression we are working on \[ (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)=\color{Maroon}\underbrace{\color{Maroon}(\beta-M^{-1}m)^TM(\beta-M^{-1}m)}_{\text{Posterior normal kernal}}\color{black}-\underbrace{\color{green}m^TM^{-1}m+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}}_{\text{shuffle into inverse-gamma part}} \]

So far we have achieved the kernel for the normal distribution part in posterior. And the rest will be shuffle into the inverse-gamma part.

Define the mean in the posterior normal kernel, \(N\) subscript in \(\mu_N^{\ }\) shows it is from the normal part \[ \begin{align} \mu_{N}^{\ } = M^{-1}m &= M^{-1}(X^TX\hat{\beta}+\Lambda^T\mu)\\ &=M^{-1}(X^TX(X^TX)^{-1}X^Xy+\Lambda^T\mu)\\ &=M^{-1}\underbrace{(X^Ty+\Lambda^T\mu)}_{m} \end{align} \]

We have defined \(m^T\) above, here we derived its transpose form i.e. \(m\).

So this is the new expression we are working on \[ (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)=\color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TM(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernel}}\color{black}-\color{green}m^TM^{-1}m+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}} \]

To summarize, the undecomposed the likelihood kernel with \(P(\beta\mid\sigma^2)\) kernel is \[ \underbrace{({y}-{X} {\beta})^{T}({y}-{X} {\beta})}_{({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) }+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)=({y}-{X} \hat{\beta})^{T}({y}-{X} \hat{\beta})+\color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TM(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernel}}\color{black}-\color{green}m^TM^{-1}m+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}} \]

Simplify each term on the right hand side

\[ \begin{aligned} ( {y}- {X} \hat{ {\beta}})^{T}( {y}- {X} \hat{ {\beta}})&= {y}^{T} {y}+\hat{ {\beta}}^{T} {X}^{T} {X} \hat{ {\beta}} -2 {y}^{T} {X} \hat{ {\beta}} \\ &= {y}^{T} {y}+\hat{ {\beta}}^{T} {X}^{T} {X} (\overbrace{X^TX)^{-1}X^Ty}^{\hat{\beta}} -2 {y}^{T} {X} \hat{ {\beta}} \\ &= {y}^{T} {y}+\hat{ {\beta}}^{T} X^Ty -2 {y}^{T} {X} \hat{ {\beta}} \\ &= {y}^{T} {y}+(\overbrace{(X^TX)^{-1}X^Ty}^{\hat{\beta}})^{T} X^Ty -2 {y}^{T} {X} \hat{ {\beta}} \\ &= y^Ty+y^TX(X^TX)^{-1}X^Ty - 2y^TX\hat{\beta}\\ &= {y}^{T} {y}+{y}^{T} {X} \hat{ {\beta}}-2{y}^{T} {X} \hat{ {\beta}}\\ &={y}^{T} {y}-{y}^{T} {X} \hat{ {\beta}}\\ \end{aligned} \]

\[ \begin{align} m^TM^{-1}m &= (X^Ty+\Lambda_\mu^T\mu)^T\underbrace{M^{-1}(X^Ty+\Lambda^T\mu)}_{\mu_{N}^{\ }}\\ &= (X^Ty+\Lambda^T\mu)^T{\mu_{N}^{\ }}\\ &= \mu_N^T M \mu_{N}^{\ } \end{align} \]

\[ \begin{align} \hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}&=\hat{ {\beta}}^T {X}^{T} {X}(X^TX)^{-1}X^Ty\\ &=\hat{ {\beta}}^T X^Ty\\ \end{align} \]

Join them

\[ \underbrace{({y}-{X} {\beta})^{T}({y}-{X} {\beta})}_{({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta}) }+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)=y^T{y}-{y}^{T} {X} \hat{ {\beta}}+\color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TM(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernal} } \color{black}-\mu_N^T M \mu_{N}^{\ }+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}} \]

Final results of undecomposed likelihood kernel plus \(P(\beta\mid \sigma^2)\) kernel is \[ \begin{align} &(y-X\beta)^T(y-X\beta)+(\beta-\mu)^T\Lambda(\beta-\mu)\\ &=(y-X\hat{\beta})^T(y-X\hat{\beta})+(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)\\ &={y}^{T} {y}-{y}^{T} {X} \hat{ {\beta}} + \color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_N)^TM(\beta-\mu_N)}_{\text{Posterior normal kernal}}\color{black}-\mu_N^T M \mu_{N}^{\ }+\color{black} {\mu}^{T} {\Lambda} {\mu}+\underbrace{\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}}_{\hat{ {\beta}}^T {X}^{T} y}\\ &= \color{Maroon}\underbrace{\color{Maroon}(\beta-\mu_{N}^{\ })^TM(\beta-\mu_{N}^{\ })}_{\text{Posterior normal kernal}}\color{black}+{y}^{T} {y} -\mu_N^T M \mu_{N}^{\ }+\color{black} {\mu}^{T} {\Lambda} {\mu} \end{align} \]

where \[ M=(X^TX+\Lambda)\\ \mu_{N}^{\ }=M^{-1}\underbrace{(X^Ty+\Lambda^T\mu)}_{m} \]

We obtain the normal part kernel in posterior, so the rest of terms in results above can combine with the inverse gamma prior \(P(\sigma^2)\) to achieve the posterior gamma kernel.

\[ \begin{align} P(\beta,\sigma^2 |X, y) \propto&\left(2 \pi \sigma^{2}\right)^{-N / 2}\left|\ {\Lambda} \right|^{1 / 2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[\left(\ {\beta}-{\mu}_{N}^{\ }\right)^{T} M\left({\beta}-{\mu}_{N}\right)\right]\right) \\ &\left(2 \pi \sigma^{2}\right)^{-k / 2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[ {y}^{T} {y}-{\mu}_{N}^{T} M{\mu}_{N}^{\ }+ {\mu}^T {\Lambda} {\mu} \right]\right) \\ &\frac{\beta ^{\alpha}}{\Gamma\left(a \right)}\left(\sigma^{2}\right)^{-(\alpha+1)} \exp \left(-\beta / \sigma^{2}\right) \end{align} \]

Because it is proportional, we can safely ignore the normalizer in last two lines \[ \frac{\beta ^{\alpha}}{\Gamma\left(a \right)}\ \text{ and }\ 2\pi^{-k/2} \]

\[ \begin{align} P(\beta,\sigma^2 |X, y)\propto&\left(2 \pi \sigma^{2}\right)^{-N / 2}\left|\ {\Lambda} \right|^{1 / 2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[\left(\ {\beta}-{\mu}_{N}^{\ }\right)^{T} M\left({\beta}-{\mu}_{N}\right)\right]\right) \\ &(\sigma^{2})^{-k / 2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[ {y}^{T} {y}-{\mu}_{N}^{T} M{\mu}_{N}^{\ }+ {\mu}^T {\Lambda} {\mu} \right]\right) \\ &\left(\sigma^{2}\right)^{-(\alpha+1)} \exp \left(-\beta / \sigma^{2}\right) \end{align} \]

which gives us chance to combine terms

\[ \begin{align} P(\beta,\sigma^2 |X, y) \propto&\left(2 \pi \sigma^{2}\right)^{-N / 2}\left|\ {\Lambda} \right|^{1 / 2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[\left(\ {\beta}-{\mu}_{N}^{\ }\right)^{T} M\left({\beta}-{\mu}_{N}\right)\right]\right) \\ &(\sigma^{2})^{-(\alpha+\frac{k}{2}+1)} \exp \left\{-\frac{1}{\sigma^{2}}\left[\beta+\frac{1}{2} \left({y}^{T} {y}-{\mu}_{N}^{T} M{\mu}_{N}^{\ }+ {\mu}^T {\Lambda} {\mu} \right)\right]\right\} \end{align} \]

Define \[ \alpha_G^{\ } = \alpha+\frac{k}{2}\\ \beta_G^{\ } = \beta+ \frac{1}{2}(y^Ty + \mu^T\Lambda\mu -\mu^T_{N}M \mu_{N}^{\ }) \]

Final conjugate posterior result is \[ \begin{align} P(\beta,\sigma^2 |X, y) \propto&\left(2 \pi \sigma^{2}\right)^{-N / 2}\left|\ {\Lambda} \right|^{1 / 2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[\left(\ {\beta}-{\mu}_{N}^{\ }\right)^{T} M\left({\beta}-{\mu}_{N}\right)\right]\right) \\ &(\sigma^{2})^{-(\alpha_G^{\ }+1)} \exp \left\{-\frac{\beta_G^{\ }}{\sigma^{2}}\right\} \end{align} \]

We have successfully shown that natural conjugate prior derivation for the posterior which is a Normal-InvGamma distribution.

\[ \begin{gathered} P\left({\beta}, \sigma^{2} \mid {X}, {y}\right) \propto P\left({\beta} \mid {X}, {y}, \sigma^{2}\right) P\left(\sigma^{2} \mid {X}, \mathbf{y}\right) \\ \text { where } \\ {\beta} \mid {X}, {y}, \sigma^{2} \sim {N}_{N}\left({\mu}_{N}, \sigma^{2} M^{-1}\right) \\ \sigma^{2} \mid {y}, {X} \sim \operatorname{InvGamma}\left(\alpha_{G}^{\ }, \beta_{G}^{\ }\right) \end{gathered} \]